Day 2 PM: Preparing data for analysis with tidyr

Visualization, modeling and inference in R is simplest when the data is collected into a single “tidy” data.frame. What constitutes a tidy data.frame depends somewhat on the context, but at a minimum it requires that each can be interpreted as an observation, each column as a variable, and each cell contains a value.

The original data set may vary from this ideal tidy format in several ways, and the tidyr package provides tools for us to convert data from messy to tidy. In particular, we show several common issues that we need to address to make data tidy. Here, we focus on three main verbs for tidying data - gather, spread and separate. We also briefly discuss what to do when the data is originally distributed over several files.

In [4]:
suppressPackageStartupMessages(library(tidyverse))
Warning message:
“Installed Rcpp (0.12.12) different from Rcpp used to build dplyr (0.12.11).
Please reinstall dplyr to avoid random crashes or undefined behavior.”Warning message:
“package ‘dplyr’ was built under R version 3.4.1”

Splitting columns with separate

One common issue is that the values in a single column are actually a combination of many variables. For example, there may be a “description” field that combines site ID,, patient ID and date. Tidy data requires that each column represents a single variable, and we need to separate the variables.

Make up a data set for illustration

In [4]:
sites <- paste("site", 1:10, sep="")
sites
  1. 'site1'
  2. 'site2'
  3. 'site3'
  4. 'site4'
  5. 'site5'
  6. 'site6'
  7. 'site7'
  8. 'site8'
  9. 'site9'
  10. 'site10'
In [5]:
subjects <- sample.int(100, 10)
subjects
  1. 64
  2. 7
  3. 96
  4. 28
  5. 89
  6. 69
  7. 29
  8. 65
  9. 51
  10. 18
In [6]:
dates <- sample(seq(as.Date('2016/01/01'),
                    as.Date('2017/01/01'), by="day"), 10)
dates
In [7]:
df <- data.frame(description=paste(sites, subjects, dates, sep=":"),
                 score=rnorm(10))
In [8]:
head(df)
descriptionscore
site1:64:2016-05-25-1.3108151
site2:7:2016-07-23 1.6269679
site3:96:2016-08-04 0.5706658
site4:28:2016-07-04-1.7594424
site5:89:2016-12-06 0.9366193
site6:69:2016-10-29 0.5264268

Using separate

In [9]:
df %>% separate(col=description, into=c("site", "pid", "date"), sep=":") %>% head
sitepiddatescore
site1 64 2016-05-25-1.3108151
site2 7 2016-07-23 1.6269679
site3 96 2016-08-04 0.5706658
site4 28 2016-07-04-1.7594424
site5 89 2016-12-06 0.9366193
site6 69 2016-10-29 0.5264268

When a single separator is not enough

In [10]:
df.1 <- data.frame(description=paste(sites, subjects, dates, sep="-"),
                   score=rnorm(10))
In [11]:
head(df.1)
descriptionscore
site1-64-2016-05-25 0.9018461
site2-7-2016-07-23 1.8470596
site3-96-2016-08-04 0.1700874
site4-28-2016-07-04-0.7663310
site5-89-2016-12-06-1.0308395
site6-69-2016-10-29 1.2056017

Using extract (advanced)

The verb extract is like separate, but instead of using a separator, it uses a regular expression to split strings. A crash course in regular expressions:

  • abc matches the characters ‘abc’
  • - matches the character ‘-‘
  • [abc] matches a or b or c
  • [a-z] matches any lower case letter
  • [0-9] matches any digit
  • . matches any single character
  • \\d matches any single digit
  • + matches one or more of the preceding character set
  • * matches zero or more of the preceding character set
  • {m, n} matches between m and n copies of the preceding character set
  • () indicates a capture group - the separated values desired
In [12]:
df.1 %>% extract(col=description,  into=c("site", "pid", "date"), regex="site([0-9]+)-([0-9]+)-(.*)") %>% head
sitepiddatescore
1 64 2016-05-25 0.9018461
2 7 2016-07-23 1.8470596
3 96 2016-08-04 0.1700874
4 28 2016-07-04-0.7663310
5 89 2016-12-06-1.0308395
6 69 2016-10-29 1.2056017

Separate rows

Less commonly, a single row contains multiple observations needs to be separated into multiple rows using separate_rows or extract_rows.

In [13]:
df.2 <- data.frame(pid=1:10, scores=replicate(10, paste(sample(1:10000, 3), collapse=",")))
In [14]:
head(df.2)
pidscores
1 5631,9112,2114
2 720,6652,4332
3 9810,243,853
4 8522,7479,9176
5 8715,7581,3025
6 4658,6293,9580
In [15]:
df.2 %>% separate_rows(col=scores, sep=",") %>% head
pidscores
1 5631
1 9112
1 2114
2 720
2 6652
2 4332

From columns to key-value pairs with gather

Sometimes a single variable is spread out over multiple columns. For example, we may wish to consider the Sepal.Length, Sepal.Width, Petal.Length and Petal.Width as variants of a single variable measure. The verb to use is gather which transforms a “wide” data.frame into a “tall” one.

For example, suppose we want ggplot2 to plot each measurement in a separate panel, colored by Species.

In [16]:
head(iris, n=3)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
In [17]:
iris %>% gather(key=measure, value = value, 1:4) %>% head(n=3)
Speciesmeasurevalue
setosa Sepal.Length5.1
setosa Sepal.Length4.9
setosa Sepal.Length4.7

Alternative ways of specifying columns to gather

In [18]:
iris %>% gather(measure, value, 1:4) %>% head(n=3)
Speciesmeasurevalue
setosa Sepal.Length5.1
setosa Sepal.Length4.9
setosa Sepal.Length4.7
In [19]:
iris %>% gather(measure, value, c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)) %>% head(n=3)
Speciesmeasurevalue
setosa Sepal.Length5.1
setosa Sepal.Length4.9
setosa Sepal.Length4.7
In [20]:
iris %>% gather(measure, value, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) %>% head(n=3)
Speciesmeasurevalue
setosa Sepal.Length5.1
setosa Sepal.Length4.9
setosa Sepal.Length4.7
In [21]:
iris %>% gather(measure, value, -Species) %>% head(n=3)
Speciesmeasurevalue
setosa Sepal.Length5.1
setosa Sepal.Length4.9
setosa Sepal.Length4.7
In [22]:
iris %>% gather(measure, value, -5) %>% head(n=3)
Speciesmeasurevalue
setosa Sepal.Length5.1
setosa Sepal.Length4.9
setosa Sepal.Length4.7

Use of gather for plotting

In [23]:
df.3 <- iris %>% gather(measure, value, -Species)
ggplot(df.3, aes(x=value, color=Species, fill=Species)) +
geom_density(alpha=0.5) +
facet_wrap(~ measure, scales = "free")
Data type cannot be displayed:
../_images/notebooks_Preparing_data_for_analysis_with_tidyr_30_1.png

From key-value pairs to columns with spread

The verb spread is the reverse of gather - it makes a “tall” data.frame into a “wide” one. However, it requires that each row have a unique identifier to do so. If we just try to apply spread to the tall version of iris, it will fail because there are 50 of each Species. We therefore need to generate a new column to make each row have a unique identifier.

In [24]:
head(df.3)
Speciesmeasurevalue
setosa Sepal.Length5.1
setosa Sepal.Length4.9
setosa Sepal.Length4.7
setosa Sepal.Length4.6
setosa Sepal.Length5.0
setosa Sepal.Length5.4
In [25]:
df.3$n <- rep(1:50, 12)
In [26]:
iris.1 <- df.3 %>% spread(measure, value)
head(iris.1)
SpeciesnPetal.LengthPetal.WidthSepal.LengthSepal.Width
setosa1 1.4 0.2 5.1 3.5
setosa2 1.4 0.2 4.9 3.0
setosa3 1.3 0.2 4.7 3.2
setosa4 1.5 0.2 4.6 3.1
setosa5 1.4 0.2 5.0 3.6
setosa6 1.7 0.4 5.4 3.9
In [27]:
iris.2 <- iris.1[, -2]
head(iris.2)
SpeciesPetal.LengthPetal.WidthSepal.LengthSepal.Width
setosa1.4 0.2 5.1 3.5
setosa1.4 0.2 4.9 3.0
setosa1.3 0.2 4.7 3.2
setosa1.5 0.2 4.6 3.1
setosa1.4 0.2 5.0 3.6
setosa1.7 0.4 5.4 3.9
In [28]:
df.3 <- df.3[,-4] # remove n to restore oroignal tall format
In [29]:
df.3 %>%
mutate(n=rep(1:50,12)) %>%
spread(measure, value) %>%
select(-n) %>%
head
SpeciesPetal.LengthPetal.WidthSepal.LengthSepal.Width
setosa1.4 0.2 5.1 3.5
setosa1.4 0.2 4.9 3.0
setosa1.3 0.2 4.7 3.2
setosa1.5 0.2 4.6 3.1
setosa1.4 0.2 5.0 3.6
setosa1.7 0.4 5.4 3.9

When variables are stored in both rows and columns (advanced)

In the toy example below, we measure peak and trough levels of something (blood glucose level, viral titers etc) recorded between visits. We convert this into a tidy data frame in two steps:

  1. Use tidyr::gather to create a new column variable visit
  2. Use tidyr::spread to create peak and trough column variables from each row of measure
In [2]:
subject <- c("Ann", "Ann", "Bob", "Bob", "Charlie", "Charlie")
measure <- c("peak", "trough", "peak", "trough", "peak", "trough")
visit1 <- c(250, 125, 1000, 750, 1500, 250)
visit2 <- c(500, 125, 900, 650, 1550, 500)
df.4 <- data.frame(subject, measure, visit1, visit2)
df.4
subjectmeasurevisit1visit2
Ann peak 250 500
Ann trough 125 125
Bob peak 1000 900
Bob trough 750 650
Charliepeak 1500 1550
Charlietrough 250 500

First gather the visit values ioto a single column

In [7]:
df.5 <- df.4 %>% gather(visit, value, visit1, visit2)
df.5
subjectmeasurevisitvalue
Ann peak visit1 250
Ann trough visit1 125
Bob peak visit1 1000
Bob trough visit1 750
Charliepeak visit1 1500
Charlietrough visit1 250
Ann peak visit2 500
Ann trough visit2 125
Bob peak visit2 900
Bob trough visit2 650
Charliepeak visit2 1550
Charlietrough visit2 500

Then spread the peak and trough measurements

In [8]:
df.5 %>% spread(measure, value)
subjectvisitpeaktrough
Ann visit1 250 125
Ann visit2 500 125
Bob visit1 1000 750
Bob visit2 900 650
Charlievisit1 1500 250
Charlievisit2 1550 500

In one fluent command

In [10]:
df.4 %>%
gather(visit, value, visit1, visit2) %>%
spread(measure, value)
subjectvisitpeaktrough
Ann visit1 250 125
Ann visit2 500 125
Bob visit1 1000 750
Bob visit2 900 650
Charlievisit1 1500 250
Charlievisit2 1550 500

Working with missing data

Sometimes cells contain missing values. Here we show the simplest way to deal with this common scenario. The approach shown is not always appropriate - consult a statistician if in doubt.

In [30]:
iris.missing <- iris
iris.missing[2,3] <- NA
iris.missing[4:6, 1] <- NA
In [31]:
head(iris.missing)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
5.1 3.5 1.4 0.2 setosa
4.9 3.0 NA 0.2 setosa
4.7 3.2 1.3 0.2 setosa
NA 3.1 1.5 0.2 setosa
NA 3.6 1.4 0.2 setosa
NA 3.9 1.7 0.4 setosa

Drop any row with missing data

This is known as complete case analysis, and is appropriate when you have abundant observations and the missing values are believed to be missing at random.

In [32]:
iris.missing %>% drop_na %>% head
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.1 3.5 1.4 0.2 setosa
34.7 3.2 1.3 0.2 setosa
74.6 3.4 1.4 0.3 setosa
85.0 3.4 1.5 0.2 setosa
94.4 2.9 1.4 0.2 setosa
104.9 3.1 1.5 0.1 setosa

Combining data with join (part of dplyr)

Sometimes data is distributed over many files. In the simplest case, each data set has exactly the same format as the others, and we just need to append additional rows. At other times, each data set contains some different variable (e.g. merging clinical and assay data) and we need to match rows according to some unique row identifier.

When each data.frame is similar

In [33]:
levels(iris$Species)
  1. 'setosa'
  2. 'versicolor'
  3. 'virginica'
In [34]:
setosa <- iris %>% filter(Species=="setosa")
versicolor <- iris %>% filter(Species=="versicolor")
virginica <- iris %>% filter(Species=="virginica")
In [35]:
dim(setosa)
  1. 50
  2. 5
In [36]:
head(setosa, 3)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
In [37]:
dim(versicolor)
  1. 50
  2. 5
In [38]:
head(versicolor, 3)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
7.0 3.2 4.7 1.4 versicolor
6.4 3.2 4.5 1.5 versicolor
6.9 3.1 4.9 1.5 versicolor
In [39]:
dim(virginica)
  1. 50
  2. 5
In [40]:
head(virginica, 3)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
6.3 3.3 6.0 2.5 virginica
5.8 2.7 5.1 1.9 virginica
7.1 3.0 5.9 2.1 virginica
In [41]:
iris.combined <- bind_rows(setosa, versicolor, virginica)
In [42]:
dim(iris.combined)
  1. 150
  2. 5
In [43]:
iris.combined %>% sample_n(10)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
434.4 3.2 1.3 0.2 setosa
1296.4 2.8 5.6 2.1 virginica
265.0 3.0 1.6 0.2 setosa
155.8 4.0 1.2 0.2 setosa
34.7 3.2 1.3 0.2 setosa
74.6 3.4 1.4 0.3 setosa
1456.7 3.3 5.7 2.5 virginica
1317.4 2.8 6.1 1.9 virginica
855.4 3.0 4.5 1.5 versicolor
285.2 3.5 1.5 0.2 setosa

When the rows are perfectly aligned

In [13]:
lengths <- iris %>% select(Species, Sepal.Length, Petal.Length)
head(lengths, 3)
SpeciesSepal.LengthPetal.Length
setosa5.1 1.4
setosa4.9 1.4
setosa4.7 1.3
In [14]:
widths <- iris %>% select(Sepal.Width, Petal.Width)
head(widths, 3)
Sepal.WidthPetal.Width
3.50.2
3.00.2
3.20.2
In [16]:
lengths %>% bind_cols(widths) %>% head
SpeciesSepal.LengthPetal.LengthSepal.WidthPetal.Width
setosa5.1 1.4 3.5 0.2
setosa4.9 1.4 3.0 0.2
setosa4.7 1.3 3.2 0.2
setosa4.6 1.5 3.1 0.2
setosa5.0 1.4 3.6 0.2
setosa5.4 1.7 3.9 0.4

When data frames are different

In [44]:
df.clin <- data.frame(site.id=c('A', 'A', 'A', 'B', 'B', 'B'),
                      pt.id=c(100, 101, 102, 100, 102, 105),
                      age=c(38, 35, 43, 50, 62, 50))
df.clin
site.idpt.idage
A 10038
A 10135
A 10243
B 10050
B 10262
B 10550
In [45]:
df.lab <- data.frame(site.id=c('B', 'B', 'B', 'A', 'A', 'A'),
                     pt.id=c(103, 100, 105, 100, 101, 102),
                     values=sample(100:1000, 6))
df.lab
site.idpt.idvalues
B 103872
B 100623
B 105307
A 100325
A 101651
A 102255

Keep only elements in common

In [46]:
df.clin %>% inner_join(df.lab, by = c("site.id", "pt.id"))
site.idpt.idagevalues
A 10038 325
A 10135 651
A 10243 255
B 10050 623
B 10550 307

Keep all rows in df.clin

In [47]:
df.clin %>% left_join(df.lab, by = c("site.id", "pt.id"))
site.idpt.idagevalues
A 10038 325
A 10135 651
A 10243 255
B 10050 623
B 10262 NA
B 10550 307

Keep all rows in df.lab

In [48]:
df.clin %>% right_join(df.lab, by = c("site.id", "pt.id"))
site.idpt.idagevalues
B 103NA 872
B 10050 623
B 10550 307
A 10038 325
A 10135 651
A 10243 255

Keep all rows in df.clin or df.lab

In [49]:
df.clin %>% full_join(df.lab, by = c("site.id", "pt.id"))
site.idpt.idagevalues
A 10038 325
A 10135 651
A 10243 255
B 10050 623
B 10262 NA
B 10550 307
B 103NA 872

When identifier columns have different names

In [50]:
df.lab.1 <- data.frame(site=c('B', 'B', 'B', 'A', 'A', 'A'),
                     pid=c(103, 100, 105, 100, 101, 102),
                     values=sample(100:1000, 6))
df.lab.1
sitepidvalues
B 103625
B 100517
B 105736
A 100739
A 101165
A 102944

Keep only elements in common

In [51]:
df.clin %>% inner_join(df.lab.1, by = c("site.id"="site", "pt.id"="pid"))
site.idpt.idagevalues
A 10038 739
A 10135 165
A 10243 944
B 10050 517
B 10550 736

When identifiers are in the row.names (advanced)

In [52]:
df.lab.2 <- data.frame(values=sample(100:1000, 6))
rownames(df.lab.2) <- paste(c('B', 'B', 'B', 'A', 'A', 'A'),
                           c(103, 100, 105, 100, 101, 102), sep="-")
df.lab.2
values
B-103963
B-100556
B-105620
A-100335
A-101886
A-102115
In [53]:
df.lab.2 %>%
    rownames_to_column %>%
  separate(rowname, into=c("site.id", "pt.id"), sep="-")
site.idpt.idvalues
B 103963
B 100556
B 105620
A 100335
A 101886
A 102115
In [54]:
df.clin %>% inner_join(
    # create new data.frame with matching identifier columns
    df.lab.2 %>%
    rownames_to_column %>%
    separate(rowname, into=c("site.id", "pt.id"), sep="-") %>%
    mutate(site.id=as.factor(site.id), pt.id=as.numeric(pt.id)),
    # specify columns to join on
    by=c("site.id", "pt.id")
)
site.idpt.idagevalues
A 10038 335
A 10135 886
A 10243 115
B 10050 556
B 10550 620
In [ ]: